Skip to content

A torchtext tutorial to pre-process a non-built-in dataset #2307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
May 18, 2023

Conversation

anp-scp
Copy link
Contributor

@anp-scp anp-scp commented May 4, 2023

This tutorial illustrates the usage of torchtext (0.15.0) on a dataset that is not a built-in dataset in torchtext.

This tutorial shows how to:

  1. Read a dataset using Torchdata 0.6.0
  2. Tokenize sentence using torchtext
  3. Apply transforms to sentence using torchtext
  4. Perform bucket batching using torchtext

cc @pytorch/team-text-core @Nayef211

@netlify
Copy link

netlify bot commented May 4, 2023

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit 6e78668
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/646664dd9c9e87000860ba6d
😎 Deploy Preview https://deploy-preview-2307--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@anp-scp
Copy link
Contributor Author

anp-scp commented May 10, 2023

Hi @pytorch/team-text-core, @Nayef211,

I request you to kindly review and approve the pull request. If you have any feedback or suggestions, I would be grateful to hear them.

Best Regards

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anp-scp thanks so much for contributing this detailed tutorial. I've left some nits around variable naming and sentence restructuring. Once these are addressed, I'm happy to accept and merge this tutorial.


Let us assume that we need to prepare a dataset to train a model that can perform English to
German translation. We will use a tab-delimited German - English sentence pairs provided by
the `Tatoeba Project <https://tatoeba.org/en>`_ which can be downloaded from this link: `Click
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just add the download link to the "this link" text? "Click Here" comes off a bit phishy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

the `Tatoeba Project <https://tatoeba.org/en>`_ which can be downloaded from this link: `Click
Here <https://www.manythings.org/anki/deu-eng.zip>`__.

Sentence pairs for other languages can be found in this link:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same comment as above. Let's hyperlink the "this link" text and get rid of the line below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 40 to 44
# * `Torchdata 0.6.0 <https://pytorch.org/data/beta/index.html>`_ (Installation instructions: `C\
# lick here <https://github.com/pytorch/data>`__)
# * `Torchtext 0.15.0 <https://pytorch.org/text/stable/index.html>`_ (Installation instructions:\
# `Click here <https://github.com/pytorch/text>`__)
# * Spacy (Docs: `Click here <https://spacy.io/usage>`__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as above. Get rid of "Click here" and hyperlink the text directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 70 to 72
dataPipe = dp.iter.IterableWrapper([FILE_PATH])
dataPipe = dp.iter.FileOpener(dataPipe, mode='rb')
dataPipe = dataPipe.parse_csv(skip_lines=0, delimiter='\t', as_tuple=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use snake case for variable names to be consistent with other tutorials. Change datePipe to data_pipe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 84 to 87
# Data pipes can be thought of something like a dataset object, on which
# we can perform various operations.
# Check `this tutorial <https://pytorch.org/data/beta/dp_tutorial.html>`_ for more details on
# data pipes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to DataPipes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested. Also I have added the word "DataPipe" and "DataPipes" in en-wordlist.txt.

# We will build vocabulary for both our source and target now.
#
# Let us define a function to get tokens from elements of tuples in the iterator.
# The comments within the function specifies the need and working of it:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence doesn't add too much value. Let's remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 240 to 241
# which we will use on our sentence. Let us take a random sentence and check the working of
# the transform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rephrase to "and check how the transform works."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 257 to 261
# * At line 2, we take a source sentence from list that we created from dataPipe at line 1
# * At line 5, we get a transform based on a source vocabulary and apply it to a tokenized
# sentence. Note that transforms take list of words and not a sentence.
# * At line 8, we get the mapping of index to string and then use it get the transformed
# sentence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 2 spaces in front of these bullets so the indentation is consistent with the rest of the tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

# * At line 8, we get the mapping of index to string and then use it get the transformed
# sentence
#
# Now we will use functions of `dataPipe` to apply transform to all our sentences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrase to "DataPipe functions"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

Comment on lines 381 to 383
# Some parts of this tutorial was inspired from this article:
# Link: `https://medium.com/@bitdribble/migrate-torchtext-to-the-new-0-9-0-api-1ff1472b5d71\
# <https://medium.com/@bitdribble/migrate-torchtext-to-the-new-0-9-0-api-1ff1472b5d71>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: directly link the hyperlink to "this article".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes as suggested.

@anp-scp
Copy link
Contributor Author

anp-scp commented May 16, 2023

Hi @Nayef211,

I have updated the tutorial as per the suggestions.

@anp-scp anp-scp requested a review from Nayef211 May 16, 2023 20:35
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for updating the tutorial with the suggestions. Will merge once all CI jobs complete! 😄

@Nayef211 Nayef211 merged commit 11aec45 into pytorch:main May 18, 2023
@anp-scp anp-scp deleted the add-torchtext-tutorial branch May 20, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants